grant_table: convert grant table rwlock to percpu rwlock
authorMalcolm Crossley <malcolm.crossley@citrix.com>
Fri, 22 Jan 2016 15:16:05 +0000 (16:16 +0100)
committerJan Beulich <jbeulich@suse.com>
Fri, 22 Jan 2016 15:16:05 +0000 (16:16 +0100)
commiteede22972fefa02100226252c430ffcca99025eb
tree7346cd2b3a3ae3e271478900e02d9cc161a7f841
parentef9dd43dddc0a31a4343a58072935c1b5c0cbbee
grant_table: convert grant table rwlock to percpu rwlock

The per domain grant table read lock suffers from significant contention when
performance multi-queue block or network IO due to the parallel
grant map/unmaps/copies occurring on the DomU's grant table.

On multi-socket systems, the contention results in the locked compare swap
operation failing frequently which results in a tight loop of retries of the
compare swap operation. As the coherency fabric can only support a specific
rate of compare swap operations for a particular data location then taking
the read lock itself becomes a bottleneck for grant operations.

Standard rwlock performance of a single VIF VM-VM transfer with 16 queues
configured was limited to approximately 15 gbit/s on a 2 socket Haswell-EP
host.

Percpu rwlock performance with the same configuration is approximately
48 gbit/s.

Oprofile was used to determine the initial overhead of the read-write locks
and to confirm the overhead was dramatically reduced by the percpu rwlocks.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
xen/arch/arm/mm.c
xen/arch/x86/mm.c
xen/common/grant_table.c
xen/include/xen/grant_table.h